Grammar-Based Recognition of Documentary Forms and Extraction of Metadata
نویسنده
چکیده
Metadata extraction is a critical aspect of ingestion of collections into digital archives and libraries. A method for automatically recognizing document types and extracting metadata from digital records has been developed. The method is based on a method for automatically annotating semantic categories such as person’s names, job titles, dates, and postal addresses that may occur in a record. It extends this method by using the semantic annotations to identify the intellectual elements of a document’s form, parsing these elements using context-free grammars that define documentary forms, and interpreting the elements of the form of the document to identify metadata such as the chronological date, author(s), addressee(s), and topic. Context-free grammars were developed for fourteen of the documentary forms occurring in Presidential records. In an experiment, the document type recognizer successfully recognized the documentary form and extracted the metadata of two-thirds of the records in a series of Presidential e-records containing twenty-one document types. 1 This paper is based on the paper given by the author at the 5th International Digital Curation Conference, December 2009; received November 2009, published June 2010. The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. ISSN: 1746-8256 The IJDC is published by UKOLN at the University of Bath and is a publication of the Digital Curation Centre. Grammar-Based Recognition of Documentary Forms and Extraction of Metadata 149 Introduction The increasing volume of digital records being acquired by archives and libraries poses significant challenges to archivist’s manual procedures for processing records. Archivists traditionally describe records at record group (or collection), series, file unit and item levels. This provides the archive’s intellectual control over its holdings and supports access to the records. Archival descriptions (summaries or metadata) include the names of the types of records that occur in a record series, for example, correspondence, memoranda or agenda. Record descriptions also include author’s and addressee’s names as well as the topics of records. Archivists cannot completely describe a collection until the collection has been manually read and reviewed. With increasing volumes of electronic records, it may be decades or even centuries before new acquisitions are described. An automated method of metadata extraction and description is needed. The next section of this paper reviews the concept of documentary form and related concepts. The related research in document type (or genre) identification is summarized. Then the method for recognizing documentary forms and extracting document metadata is described. An implementation and experimental evaluation of the method is described. Finally, the results of the research are summarized with a discussion of open research issues. Documentary Form, Record Types and Document Type The International Council of Archivists (1999) in its standard for archival description defines a (documentary) form as “A class of documents distinguished on the basis of common physical (e.g. water colour, drawing) and/or intellectual (e.g. diary, journal, day book, minute book) characteristics of a document”. The standard also specifies that the names of forms be used in describing record series and titling records. The National Archives and Records Administration’s guideline for cataloging archival materials defines specific records type as “the intellectual format of the archival materials” (NARA, 2008). The purpose of the specific records type is that it “Enables users to search for archival materials by the types of document represented in the archival materials”. The guidelines also specify that specific records types be used in describing record series. The science of diplomatics defines documentary form as “the rules of representation used to convey a message, that is, the characteristics of a document which can be separated from the determination of the particular subjects, or places it concerns. Documentary form is both physical and intellectual” (Duranti, 1998). The intellectual form of a document is “the sum of a record's formal attributes that represent and communicate the elements of the action in which the record is involved and of its immediate context, both documentary and administrative”. The physical form of a document is “the overall appearance, configuration, or shape, derived from its material characteristics and independent of its intellectual content” (Duranti, 1998). The International Journal of Digital Curation Issue 1, Volume 5 | 2010 150 William Underwood The Standard Generalized Markup Language (SGML) uses a Document Type Definition (DTD) to define document form (International Standards Organization, 1986). A DTD specifies a set of elements, their relationships, and the tag set is used to markup the document. The Extensible Markup Language (XML) is a simpler subset of SGML (World Wide Web Consortium, 2006). The concept of document structure as defined by a XML DTD is a formal model of the concept of the intellectual form of a document. The concept of genre is similar to that of documentary form but includes classes of documents that are not characterized by their intellectual or physical form, but by pragmatic or rhetorical features. Examples of written genre include academic prose, biography, instructional material and newspaper reports. See Santini (2004b) for a discussion. Figure 1 shows examples of the names of some of the specific documentary forms (record types) discovered in Presidential e-records. Figure 1. Documentary Forms in Presidential Records. Related Research The reader is referred to Santini (2004b) for a survey of state-of-the-art approaches to genre identification of digital documents. Santini (2004a) also describes a method based on part-of-speech trigrams for classifying ten genres including conversations, interviews, public debate, biography and reportage. The objective of the research of Kim and Ross (2007a, 2007b) is the recognition of genre for the purpose of metadata extraction from digital records ingested into digital archives or libraries. Their approach is to identify features of documents that will allow them to automatically classify documents by genre. The features they have identified include: image features, syntactic features, stylistic features, semantic structure, and domain knowledge features. These features are used with an image classifier, an n–gram model classifier and a stylo-metric classifier. Our research differs from that of Kim and Ross, and of other researchers in genre identification, in that our objective is to recognize a document’s form by parsing its intellectual elements using grammars characterizing document types. However, there are document types for which it is necessary to use pragmatic features to recognize the genre, for example, white papers and biography. The International Journal of Digital Curation Issue 1, Volume 5 | 2010 Grammar-Based Recognition of Documentary Forms and Extraction of Metadata 151 A Method for Recognizing Documentary Forms and Extracting Document Metadata Legacy and current Presidential e-records are not XML documents, but e-records in proprietary file formats. However, it will be shown that it is possible to define, recognize and annotate the intellectual elements of a textual e-record, and that the structure of the intellectual elements of a particular documentary form can be defined with rules similar to those of an XML document type definition. This will enable the recognition of documentary forms and extraction of document metadata. The process of automatically recognizing the document types of documents in proprietary file formats is outlined in Figure 2. The italicized phrases to the right of the downward pointing arrows indicate inputs and outputs of the numbered processing steps (Underwood & Laib, 2008). Figure 2. The Process of Document Type Recognition and Metadata Extraction. The first through the sixth steps are a previously implemented method for automatically annotating semantic categories in text such as person’s names, job titles, dates, location names, postal addresses and organization names (Underwood & Isbell, 2008). The input to the method is an e-record in a proprietary file format. The first step converts that record to a plain text or html file format. The third step, Wordlist lookup, matches the terms (tokens) in the document against approximately 170,000 terms in 181 wordlists for such classes as person first names, surnames, city names, country names, months, and organizational nouns. If there is a match, the text is annotated with a tag for the name of that class. The sixth step, Semantic Tagger applies rules to the previously annotated text to produce additional annotations, for example, person’s full names, locations made up of city and state or country names. The International Journal of Digital Curation Issue 1, Volume 5 | 2010 152 William Underwood Figure 3 shows a document whose paragraphs, dates, times, and person, location and organization names have been annotated by the first six steps of the method. Figure 3. Document with Annotated Paragraphs and Semantic Categories. The seventh step, Intellectual Element Annotator, recognizes and annotates the intellectual elements occurring in a document. Currently, there are about 100 intellectual element rules. They apply to the annotated document and identify text strings such as FROM:, SUBJECT:, Attachment, or previously annotated semantic categories such as date, address and person’s name as intellectual elements. Figure 4 shows the document in Figure 3 after the annotation of the intellectual elements.
منابع مشابه
Metadata Enrichment for Automatic Data Entry Based on Relational Data Models
The idea of automatic generation of data entry forms based on data relational models is a common and known idea that has been discussed day by day more than before according to the popularity of agile methods in software development accompanying development of programming tools. One of the requirements of the automation methods, whether in commercial products or the relevant research projects, ...
متن کاملA Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کاملEnriching Perspectives in Exploring Cultural Heritage Documentaries Using Informedia Technologies
Speech recognition, image processing, and language understanding technologies have successfully been applied to broadcast news corpora to automate the extraction of metadata and make use of it in building effective video news retrieval interfaces. This paper discusses how these multimedia technologies can be adapted to enrich perspectives in exploring cultural heritage documentaries. Through au...
متن کاملThe Comparative Effect of Task Type and Learning Conditions on the Achievement of Specific Target Forms
The completion mode (individual, collaborative) of the tasks and the conditions under which these modes are performed have been reported to play an important role in language learning. The present study aimed to investigate the effects of employing text editing tasks performed both individually and collaboratively, on the achievement of English grammar under explicit and implicit learning condi...
متن کاملImpact of Consciousness-raising Task and Structure-based Production Task on Learning Comparative and Superlative Forms by Iranian Elementary EFL Learners
This study aimed to investigate the relative effectiveness of consciousness-raising tasks and structure-based production tasks in comparison with the traditional teaching in learning comparative and superlative forms, following a task-based approach to teaching English grammar. To this end, from among 82 female elementary-level high school students having taken a Solutions Placement Test (2010)...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IJDC
دوره 5 شماره
صفحات -
تاریخ انتشار 2010